대규모 언어 모델 아키텍처의 진화: BERT에서 GPT와 T5로

Transformer 아키텍처의 삼각형

대규모 언어 모델의 진화는 패러다임 전환으로, 특정 작업용 모델에서 단일 아키텍처가 다양한 자연어 처리(구문 분석) 요구에 적응할 수 있는 '통합 사전 훈련' 방식으로 전환되는 현상이다.

이 전환의 핵심은 모델이 문장 내 각 단어의 중요도를 가중치로 평가할 수 있게 해주는 자기 주의(Self-Attention) 메커니즘이다:

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

1. 인코더 전용 (BERT)

기작:마스크된 언어 모델링 (MLM).
동작 방식:양방향 컨텍스트; 모델은 숨겨진 단어를 예측하기 위해 문장을 한 번에 전체적으로 "보는" 방식이다.
적합한 분야:자연어 이해(NLU), 감성 분석 및 개체명 인식(엔터티 인식, NER).

2. 디코더 전용 (GPT)

기작:자기 회귀 모델링.
동작 방식:좌에서 우로 순차 처리; 이전 컨텍스트(인과 마스킹)에만 기반하여 다음 토큰을 예측한다.
적합한 분야:자연어 생성(NLG) 및 창의적 글쓰기. 이는 현재의 대규모 언어 모델인 GPT-4와 Llama 3의 기초가 된다.

3. 인코더-디코더 (T5)

기작:텍스트-텍스트 전송 트랜스포머.
동작 방식:인코더는 입력 문자열을 밀집 표현으로 변환하고, 디코더는 목표 문자열을 생성한다.
적합한 분야:번역, 요약, 동등성 작업 등.

핵심 통찰: 디코더의 지배력

업계는 대부분 디코더 전용아키텍처로 집중하고 있다. 이는 제로샷 시나리오에서 뛰어난 확장 법칙과 발생하는 추론 능력을 갖추고 있기 때문이다.

VRAM 컨텍스트 창 크기 영향

디코더 전용 모델에서는 KV 캐시시퀀스 길이에 비례해 선형적으로 증가한다. 10만 컨텍스트 창은 8천 창보다 훨씬 많은 VRAM을 필요로 하며, 양자화 없이 장문 컨텍스트 모델을 로컬에 배포하는 것은 어렵다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?

Decoders scale more effectively for generative tasks and follow-up instructions via next-token prediction.

Encoders cannot process text bidirectionally.

Decoders require less training data for classification tasks.

Encoders are incompatible with the Self-Attention mechanism.

Question 2

Which architecture treats every NLP task as a "text-to-text" problem?

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5)

Recurrent Neural Networks (RNN)

Challenge: Architectural Bottlenecks

Analyze deployment constraints based on architecture.

If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.

Step 1

Identify the architectural bottleneck regarding context processing.

Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.

Step 2

Justify the preference using Scaling Laws.

Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.